feat: allow inserting subschemas #3041

wjones127 · 2024-10-24T19:17:30Z

Allow inserting subset of columns in the schema, if missing columns are nullable. Missing columns will be filled with null values. This even works with nested fields.

For example:

import lance
import pyarrow as pa

data = [
    {"vec": [1.0, 2.0, 3.0], "metadata": {"x": 1, "y": 2}},
    {"metadata": {"x": 3}},
    {"vec": [2.0, 3.0, 5.0], "metadata": {"y": 4}},
]
table = pa.Table.from_pylist(data)
ds = lance.write_dataset(table, "./demo")
ds.to_table().to_pandas()

               vec               metadata
0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
1             None  {'x': 3.0, 'y': None}
2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}

new_data = [
    {"metadata": {"y": 6}},
]
new_table = pa.Table.from_pylist(new_data)
ds = lance.write_dataset(new_table, "./demo", mode="append")
ds.to_table().to_pandas()

               vec               metadata
0  [1.0, 2.0, 3.0]   {'x': 1.0, 'y': 2.0}
1             None  {'x': 3.0, 'y': None}
2  [2.0, 3.0, 5.0]  {'x': None, 'y': 4.0}
3             None    {'x': None, 'y': 6}

Closes #3016

codecov-commenter · 2024-10-24T22:10:43Z

Codecov Report

Attention: Patch coverage is 93.02987% with 49 lines in your changes missing coverage. Please review.

Project coverage is 77.06%. Comparing base (2d3dd67) to head (c897c09).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
rust/lance-core/src/datatypes/schema.rs	92.77%	16 Missing and 3 partials ⚠️
rust/lance/src/dataset/fragment.rs	79.76%	16 Missing and 1 partial ⚠️
rust/lance/src/dataset.rs	96.02%	3 Missing and 9 partials ⚠️
rust/lance/src/io/commit.rs	66.66%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3041      +/-   ##
==========================================
- Coverage   77.16%   77.06%   -0.10%     
==========================================
  Files         240      240              
  Lines       79764    80417     +653     
  Branches    79764    80417     +653     
==========================================
+ Hits        61548    61975     +427     
- Misses      15071    15265     +194     
- Partials     3145     3177      +32

Flag	Coverage Δ
unittests	`77.06% <93.02%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rust/lance/src/dataset.rs

westonpace

Great work. Thanks for cleaning up all that schema / field comparison logic too!

rust/lance-core/src/datatypes/field.rs

westonpace · 2024-11-05T21:31:55Z

rust/lance-core/src/datatypes/field.rs

@@ -476,13 +432,13 @@ impl Field {
    ///
    /// If the ids are `[2]`, then this will include the parent `0` and the
    /// child `3`.
-    pub(crate) fn project_by_ids(&self, ids: &[i32]) -> Option<Self> {
+    pub(crate) fn project_by_ids(&self, ids: &[i32], include_all_children: bool) -> Option<Self> {
        let children = self


Super minor nit: I'm guessing the optimizer catches this but it might be faster to only calculate children if we need it...

pub(crate) fn project_by_ids(&self, ids: &[i32], include_all_children: bool) -> Option<Self> { if !ids.contains(&self.id) { return None; }

We actually don't want to early return, because even if a field isn't selected, we want to check if it has children that are.

rust/lance-core/src/datatypes/field.rs

westonpace · 2024-11-05T21:35:40Z

rust/lance/src/dataset/fragment.rs

+        // Check if there are any fields that are not in any data files
+        let field_ids_in_files = opened_files
+            .iter()
+            .flat_map(|r| r.projection().fields_pre_order().map(|f| f.id))
+            .filter(|id| *id >= 0)
+            .collect::<HashSet<_>>();
+        let mut missing_fields = projection.field_ids();
+        missing_fields.retain(|f| !field_ids_in_files.contains(f) && *f >= 0);
+        if !missing_fields.is_empty() {
+            let missing_projection = projection.project_by_ids(&missing_fields, true);
+            let null_reader = NullReader::new(Arc::new(missing_projection), opened_files[0].len());
+            opened_files.push(Box::new(null_reader));
+        }


This is neat!

rust/lance/src/dataset/write.rs

In #2639 we added support for *updating* subcolumns. In #3041 we added support for *inserting* subcolumns. This PR adds support for upserting them (or doing insert-if-not-exists). Closes #2904 ## Example ```python import pyarrow as pa import lance table = pa.table({ "id": range(3), "a": [1.0, 2.0, 3.0], "c": ["x", "x", "x"] }) dataset = lance.write_dataset(table, "example") # Upsert: when_matched_update_all + when_not_matched_insert_all new_data = pa.table({ "id": [2, 3], "c": ["y", "y"] }) ( dataset .merge_insert(on="id") .when_matched_update_all() .when_not_matched_insert_all() .execute(new_data) ) dataset.to_table().to_pandas() ``` ``` id a c 0 0 1.0 x 1 1 2.0 x 2 2 3.0 y 3 3 NaN y ``` ```python # Insert-if-not-exists: when_not_matched_insert_all new_data = pa.table({ "id": [3, 4], "c": ["z", "z"] }) ( dataset .merge_insert(on="id") .when_not_matched_insert_all() .execute(new_data) ) dataset.to_table().to_pandas() id a c 0 0 1.0 x 1 1 2.0 x 2 2 3.0 y 3 3 NaN y 4 4 NaN z ```

github-actions bot added the enhancement New feature or request label Oct 24, 2024

wjones127 force-pushed the feat/insert-subschema branch from ef8c62d to 24c27e7 Compare October 30, 2024 20:44

github-actions bot added the python label Nov 1, 2024

wjones127 force-pushed the feat/insert-subschema branch from 3c1ac51 to e239a69 Compare November 5, 2024 00:11

wjones127 commented Nov 5, 2024

View reviewed changes

rust/lance/src/dataset.rs Show resolved Hide resolved

wjones127 marked this pull request as ready for review November 5, 2024 02:59

westonpace approved these changes Nov 5, 2024

View reviewed changes

wjones127 added 10 commits November 5, 2024 16:03

start to accept subschemas

4f331d4

finish subschema checks

ea20a60

get top-level subschema inserts working

ece60a2

add test for column order and nested subschema

bda2649

get python test passing

c955f82

fix issues in test

272df86

update test

d3c235b

fix nested fields

492d3fa

fix python

f6c27d1

fix for migration test

f916af7

wjones127 force-pushed the feat/insert-subschema branch from c53fb4a to f916af7 Compare November 6, 2024 23:20

wjones127 added 5 commits November 6, 2024 15:35

add initial test

478a523

add failing test for take

c09e67b

fix handling blobs

711f0a0

cleanup

48b6aa3

pr feedback

ce74c63

wjones127 requested a review from westonpace November 7, 2024 18:13

wjones127 added 4 commits November 7, 2024 11:49

revert early return

22e8e8d

try to fix test

e1ec188

validate the dataset

59a70d7

cleanup

c897c09

wjones127 merged commit 6d24d84 into lancedb:main Nov 7, 2024
26 checks passed

wjones127 mentioned this pull request Dec 17, 2024

feat: merge-insert supports inserting subset of columns #3100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow inserting subschemas #3041

feat: allow inserting subschemas #3041

wjones127 commented Oct 24, 2024 •

edited

Loading

codecov-commenter commented Oct 24, 2024 •

edited

Loading

westonpace left a comment

westonpace Nov 5, 2024

wjones127 Nov 7, 2024

westonpace Nov 5, 2024

feat: allow inserting subschemas #3041

feat: allow inserting subschemas #3041

Conversation

wjones127 commented Oct 24, 2024 • edited Loading

codecov-commenter commented Oct 24, 2024 • edited Loading

Codecov Report

westonpace left a comment

Choose a reason for hiding this comment

westonpace Nov 5, 2024

Choose a reason for hiding this comment

wjones127 Nov 7, 2024

Choose a reason for hiding this comment

westonpace Nov 5, 2024

Choose a reason for hiding this comment

wjones127 commented Oct 24, 2024 •

edited

Loading

codecov-commenter commented Oct 24, 2024 •

edited

Loading